Cythonize away some perf hot spots #709

leofang · 2025-06-13T16:50:33Z

Description

Less aggressive version of #677.

Based on the summary in #658 (comment), this PR offers a performance optimization over identified hotspots to bring us to a lot closer with our reference (CuPy). The optimization strategy is to

implement everything in pure Python (we do this today for cuda.core)
once hotspots are identified, we lower to Cython
most importantly, we still call cuda.bindings Python APIs in the Cython code, so as to avoid introducing CTK as a build-time dependency (and therefore having to ship two separate packages cuda-core-cu11 and cuda-core-cu12)

In other words, this PR tries to find a reasonable balance between performance, easy of development, and easy of deployment, without introducing any breaking change.

Preliminary data:

cuda.core main branch

In [5]: %timeit e = dev.create_event()
4.65 μs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit s = dev.create_stream()
7.7 μs ± 9.98 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

this PR

In [6]: %timeit e = dev.create_event()
1.11 μs ± 6.91 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [4]: %timeit s = dev.create_stream()
4.12 μs ± 14 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

cupy

In [8]: %timeit e = cp.cuda.Event(disable_timing=True)
749 ns ± 5.45 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [14]: %timeit s = cp.cuda.Stream(non_blocking=True)
3.8 μs ± 8.54 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-06-13T16:50:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

leofang · 2025-06-25T04:13:48Z

/ok to test 227d9c1

github-actions · 2025-06-25T04:33:50Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-709/
https://nvidia.github.io/cuda-python/pr-preview/pr-709/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-709/cuda-bindings/
Preview will be ready when the GitHub Pages deployment is complete.

leofang · 2025-06-25T18:52:23Z

/ok to test 48de1b3

oleksandr-pavlyk · 2025-06-30T17:31:38Z

/ok to test

copy-pr-bot · 2025-06-30T17:31:42Z

/ok to test

@oleksandr-pavlyk, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

oleksandr-pavlyk · 2025-06-30T17:32:50Z

/ok to test 0d573fc

To fix test failures with CTK 11.8 and driver 535.247.01 only attempt to query _ctx_handle if _device_id is None. Ensure that context handle is set in Stream.context property

oleksandr-pavlyk · 2025-06-30T18:08:27Z

/ok to test 30de720

leofang · 2025-06-30T19:12:13Z

PR description updated. Thanks to @oleksandr-pavlyk, the CI failure was identified and fixed (the refactoring introduced a call to cuStreamGetCtx, which could not be called during stream capturing with 12.2 driver). This is ready to review.

cuda_core/cuda/core/experimental/_context.pyx

cuda_core/cuda/core/experimental/_event.pyx

cuda_core/cuda/core/experimental/_stream.pyx

cuda_core/cuda/core/experimental/_utils/cuda_utils.pyx

cuda_core/tests/test_cuda_utils.py

cuda_core/cuda/core/experimental/_event.pyx

cuda_core/cuda/core/experimental/_stream.pyx

oleksandr-pavlyk · 2025-07-01T01:00:18Z

/ok to test e95c4b1

emcastillo

LGTM!

oleksandr-pavlyk · 2025-07-01T13:47:54Z

/ok to test f4531e5

oleksandr-pavlyk · 2025-07-01T13:54:40Z

cuda_core/cuda/core/experimental/_event.pyx

-        self._mnff = Event._MembersNeededForFinalize(self, None)
-
-        options = check_or_create_options(EventOptions, options, "Event options")
+    def _init(cls, device_id: int, ctx_handle: Context, options=None):


Since Event contains native class members, perhaps adding __cinit__ to initialize them is appropriate. Something like

def __cinit__(self): self._timing_disabled = False self._busy_waited = False self._device_id = -1

I also think it would be safe to set object class members to None.

This would ensure that Event.__new__(Event) would return an initialized struct.

I think Cython sets everything to None for us, but it'd be good to verify this indeed

Cython additionally takes responsibility of setting all object attributes to None,

https://cython.readthedocs.io/en/latest/src/userguide/special_methods.html#initialisation-methods-cinit-and-init

Ok, let's leave object members out. Should I push adding Event.__cinit__ ?

I think the same section says all members are zero/null initialized?

Yes, but is it appropriate to zero initialize _device_id? Perhaps it does not matter much.

leofang · 2025-07-01T14:50:54Z

CI is green

oleksandr-pavlyk

LGTM!

leofang added 2 commits June 13, 2025 14:52

cythonize event

8e56d60

cythonize context/event/util

9b96cc7

leofang added this to the cuda.core parking lot milestone Jun 13, 2025

leofang self-assigned this Jun 13, 2025

leofang added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Jun 13, 2025

github-project-automation bot added this to CCCL Jun 13, 2025

github-project-automation bot moved this to Todo in CCCL Jun 13, 2025

leofang modified the milestones: cuda.core parking lot, cuda.core beta 5 Jun 18, 2025

leofang added 2 commits June 24, 2025 00:09

Merge branch 'main' into drop_to_cython

09e7ecc

Cython 3.0+ supports __del__ for cdef classes

3495a1f

leofang force-pushed the drop_to_cython branch from 7313a80 to 3495a1f Compare June 25, 2025 01:22

inline precondition to reduce overhead

6f346e4

leofang force-pushed the drop_to_cython branch from 02c8e4e to 6f346e4 Compare June 25, 2025 01:53

leofang added 2 commits June 25, 2025 02:58

centralize check_or_create_options

4282b99

add back error check

7b45954

leofang linked an issue Jun 25, 2025 that may be closed by this pull request

[FEA]: Faster initialization time for cuda.core abstractions #658

Open

1 task

leofang added 2 commits June 25, 2025 04:10

cythonize stream + bug fixes

3ff5e94

make linter happy

227d9c1

bug fix

e96bb4a

Cython mis-compiles Optional types

48de1b3

leofang closed this Jun 25, 2025

oleksandr-pavlyk added 2 commits June 30, 2025 13:07

Modified _get_context_device() helper routine

b305ea2

To fix test failures with CTK 11.8 and driver 535.247.01 only attempt to query _ctx_handle if _device_id is None. Ensure that context handle is set in Stream.context property

Verify that stream.context handle is not None in test_stream_context

30de720

oleksandr-pavlyk force-pushed the drop_to_cython branch from 0d573fc to 30de720 Compare June 30, 2025 18:07

leofang mentioned this pull request Jun 30, 2025

[FEA]: Faster initialization time for cuda.core abstractions #658

Open

1 task

leofang changed the title ~~WIP: Cythonize away some perf hot spots~~ Cythonize away some perf hot spots Jun 30, 2025

leofang requested review from shwina and oleksandr-pavlyk June 30, 2025 19:12

leofang marked this pull request as ready for review June 30, 2025 19:12

kkraus14 reviewed Jun 30, 2025

View reviewed changes

leofang added 2 commits June 30, 2025 22:11

cache success enums

cc6339e

Merge branch 'main' into drop_to_cython

7d68db2

oleksandr-pavlyk reviewed Jun 30, 2025

View reviewed changes

nit: avoid cdef void

e95c4b1

kkraus14 previously approved these changes Jul 1, 2025

View reviewed changes

github-project-automation bot moved this from Needs Triage to In Review in CCCL Jul 1, 2025

emcastillo approved these changes Jul 1, 2025

View reviewed changes

In Event.close set handle to None before raising error

f4531e5

oleksandr-pavlyk dismissed kkraus14’s stale review via f4531e5 July 1, 2025 13:47

oleksandr-pavlyk reviewed Jul 1, 2025

View reviewed changes

leofang mentioned this pull request Jul 1, 2025

cuda_core forward compatibility changes. #722

Open

kkraus14 approved these changes Jul 1, 2025

View reviewed changes

oleksandr-pavlyk approved these changes Jul 1, 2025

View reviewed changes

Cythonize away some perf hot spots #709

Are you sure you want to change the base?

Cythonize away some perf hot spots #709

Conversation

leofang commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Jun 13, 2025

Uh oh!

leofang commented Jun 25, 2025

Uh oh!

github-actions bot commented Jun 25, 2025

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

leofang commented Jun 25, 2025

Uh oh!

oleksandr-pavlyk commented Jun 30, 2025

Uh oh!

copy-pr-bot bot commented Jun 30, 2025

Uh oh!

oleksandr-pavlyk commented Jun 30, 2025

Uh oh!

oleksandr-pavlyk commented Jun 30, 2025

Uh oh!

leofang commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oleksandr-pavlyk commented Jul 1, 2025

Uh oh!

emcastillo left a comment

Choose a reason for hiding this comment

Uh oh!

oleksandr-pavlyk commented Jul 1, 2025

Uh oh!

oleksandr-pavlyk Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

oleksandr-pavlyk Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

oleksandr-pavlyk Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

leofang commented Jul 1, 2025

Uh oh!

oleksandr-pavlyk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leofang commented Jun 13, 2025 •

edited

Loading